Update 1. Profiling Hyperbolic Models


Author: Tiberiu Iancu
Date: 01.07.2025

Introduction

This analysis looks into the performance of hyperbolic deep learning models from a compute perspective, in order to identify inefficiencies and bottlenecks in the implementation.
The mathematical operations themselves, as well as potential numerical optimizations are out of scope.



Setup and methodology

System: I ran the experiments on my own PC with Ubuntu 24.04, a RTX 2070 GPU with 8GB VRAM on CUDA 12.8 and i7 6800K CPU .

Workload: I chose to profile a simple 2-layer MLP with hidden units, as well as a ResNet (architecture details provided in ResNet section). The model size as well as the batch size are chosen as to maximize GPU occupancy (=time spent doing computation relative to the startup overhead). For both I implemented an Euclidean as well as a Hyperbolic version, then ran a simple training for 2 iterations: one iteration for warmup, and one iteration recorded with the built-in torch profiler, which captures CPU and GPU traces, as well as memory usage. Each iteration consists of a simple forward + backward pass (and an Adam optimizer step). A third configuration is further tested for the hyperbolic network: the model is first compiled with torch.compile, which creates triton kernels that fuse together consecutive element-wise operations. In simpler terms, compile speeds up execution by encouraging reuse of L2 cache in the GPU.

Dataset: I used Caltech256 (224x224 input, 256 classes), as I wasn't able to get access to ImageNet. I initially tried CIFAR10, but the low resolution resulted in low GPU occupancy, so I chose the more realistic scenario of higher resolution images.


MLP

Euclidean MLP

For reference, we present a trace captured from the training of the Euclidean MLP. The figure below displays the different stages of the training: transforming and loading data into GPU memory, forward pass (+criterion computation), backward pass, and optimizer step.
Writeup1_mlp_euc.drawio (2).png
Two things stand out: first, the GPU is continuously busy, and does not wait for the CPU to queue operations. Second, the bulk of the execution time is spent on matrix multiplications.

Hyperbolic MLP

For simplicity, from now on we'll only inspect GPU traces. Below you can see a GPU trace from the forward pass.
Writeup1_mlp_hyp.png
Compared to the Eucliden MLP, the Hyperbolic model has additional computation to do. First, the tensors must be moved to the manifold. This operation can likely be optimized. Second, the forward pass now consists of two compute intensive operations: the euclidean norm of the weights, as well as the usual matrix multiplication operation between input and weights. This layer could benefit from some optimization, and as we'll see later, torch.compile can help here.
The backward pass looks even more poorly optimized. In this case, the two matrix multiplications take up only a small portion of the total computation. Here we would expect a lot of performance gain from fusing operations together.
Writeup1_mlp_hyp_backward.png

Hyperbolic MLP + compile

When compiling the model, the performance improves significantly. The forward pass is now computed in 1.3ms, compared to 4.1ms in the non-compiled version. Torch seems to have fused most element-wise operations together, but more importantly, it optimized the launch parameters of the kernel, leading to a significant performance improvement in both euclidean norm, as well as matrix multiplication.
Pasted image 20250626151627.png
The compiled backward pass also improved performance by a factor of 4, from 28ms to only 7ms, mostly attributed to the fusing of the many element-wise operations:
Pasted image 20250626152122.png
The compilation also drastically improves the performance of the optimizer, taking the execution of one step from 50 to 16.7ms.

The execution footprint of one iteration is summarized in the table below:

Move to manifold Forward Backward Optimizer Total Peak memory (GB)
Euclidean
-
1ms
1.5ms
9ms
11.5ms
1.1
Hyperbolic
1ms
4ms
28ms
50ms
83ms
1.4
Hyperbolic + compile
1ms
1.3ms
7ms
16.7ms
26ms
1.4

ResNet

Euclidean ResNet

For reference, we first inspect the GPU trace of a ResNet18 with batch size 8. Even on such a small architecture the GPU is mostly occupied executing convolution kernels. Here, the forward pass duration is 20ms, and the backward pass takes approximately 19ms.
Pasted image 20250701121612.png

Hyperbolic ResNet

Sadly, the hyperbolic network's computational graph is far too large, and analyzing such large traces is not feasible. Furthermore, I was unable to torch.compile larger hyperbolic resnets. I chose to profile a much smaller ResNet architecture with only ~1500 parameters and batch size 2.
Even such small model takes an unreasonable time for the forward pass, a staggering 2.25 seconds. Out of this, only 150ms is spent in the ResNet blocks, while over 2seconds is spent calculating the frechet mean.
Writeup1_resnet_hyp_fwd.drawio 1.png
When inspecting the ResNet blocks themselves, we identify a second bottleneck: hyperbolic BatchNorm. Even when avoiding the frechet mean by using the midpoint optimization, batch norm is still the most time-consuming layer. Kernel fusing can likely greatly improve performance.
Zooming in on the GPU stream in the conv layer, we see that very little time is spent actually executing kernels. Part of the reason for this is the small model size, so this result should be taken with a grain of salt. However, the conv2d implementation suggests that the operation is inefficient: the high-level torch implementation (unfold + dense + fold) makes poor use of the GPU. Optimizing this layer by re-writing the operation as a GPU kernel (e.g., using the Winograd algorithm) should greatly improve throughput.
I'll leave the backward pass out, as it doesn't provide any additional insight.

Hyperbolic ResNet + compile

Compiling the model provides the biggest improvement we've seen so far: the frechet mean operation was reduced from over 2 seconds to 4ms!!!!!! This seems to be due to the fusion of many element-wise operations.
Pasted image 20250701124953.png
The runtime of the ResNet blocks was reduced from 150 to 60ms; we can no longer inspect performance of specific layers in the trace, as these have been fused together. Again, GPU utilization is very low due to the small model size. We'll leave this out of the analysis as it doesn't provide any insight.
The compiled ResNet peaked at 0.04GB memory used, while the uncompiled ResNet peaked at 0.1GB.


Conclusion

In this analysis we've seen that hyperbolic networks suffer from severe computational bottlenecks, mainly due to the high-level Torch implementation. Compiling the models goes a long way in optimizing operations. However, compiling does not seem to be the definitive solution to our problems: it acts rather like a band-aid on the inefficiency wound. There are three main drawbacks that come with compiling. First, the large overhead (even compiling the small ResNet takes a few minutes) slows down development. Second, compilation of larger models seems to be unstable and does not always succeed. Third, compilation results cannot be saved, meaning re-compilation must be performed on every run.
I believe that moving forward, many of the hyperbolic operations need their own GPU kernel in order to make ResNet viable at scale. The main candidates are: feed forward, conv, batch norm, and average pooling (frechet mean). From these, we've seen that compilation is fast and painless for the feed forward layer and produces very good results. Similarly, compiling the frechet mean operation is highly effective, and having to do this by hand would likely be extremely time-consuming (judging by the number of fused kernels in the GPU trace). This leaves the convolutional block and the batch norm. I believe these can benefit most from manual optimization, as internally they perform aggregations or more complex data manipulation.